engine: keep SimpleEngine serialized across cancellation by krystophny · Pull Request #8 · computor-org/vllm-mlx

krystophny · 2026-03-24T14:34:01Z

Summary

keep the SimpleEngine generation lock held until cancelled asyncio.to_thread(...) workers actually finish
cover the race with a focused regression test

Reproducer

FortBench hit this on Apple Silicon during local Qwen runs:

a non-streaming /v1/responses request was cancelled by the disconnect guard
the request task exited, but the blocking MLX worker thread was still running
the next request entered Metal before the first worker drained
MLX/Metal aborted with failed assertion A command encoder is already encoding to this command buffer``

Fix

SimpleEngine now routes blocking MLX calls through a cancellation-safe helper that:

acquires _generation_lock
runs the blocking call in asyncio.to_thread(...)
on cancellation, waits for that worker task to finish before releasing the lock

This keeps MLX/Metal single-threaded even when the client disconnects.

Tests

python3 -m unittest tests.test_simple_engine_cancel_serialization -v
python3 -m compileall vllm_mlx tests

qodo-code-review · 2026-03-24T14:34:18Z

Review Summary by Qodo

Keep SimpleEngine serialized across request cancellation

🐞 Bug fix

Walkthroughs

Description

• Introduce _run_blocking_serialized() helper to prevent MLX/Metal race conditions
• Ensure generation lock held until worker thread completes, even on cancellation
• Replace direct asyncio.to_thread() calls with cancellation-safe wrapper
• Add regression test verifying no concurrent MLX access after request cancellation

Diagram

flowchart LR
  A["Request cancelled"] --> B["_run_blocking_serialized acquires lock"]
  B --> C["asyncio.to_thread starts worker"]
  C --> D["Cancellation caught"]
  D --> E["Wait for worker to finish"]
  E --> F["Release lock safely"]
  F --> G["Next request enters MLX"]

File Changes

1. tests/test_simple_engine_cancel_serialization.py 🧪 Tests +80/-0

Regression test for cancellation-safe serialization

• New regression test file with SimpleEngineCancelSerializationTests class
• Tests that cancellation does not release generation lock before worker finishes
• Verifies concurrent MLX access counter stays at 1 (no overlap)
• Uses threading events and mocks to simulate concurrent generate calls

tests/test_simple_engine_cancel_serialization.py

2. vllm_mlx/engine/simple.py 🐞 Bug fix +103/-93

Implement cancellation-safe serialized blocking operations

• Add new _run_blocking_serialized() method that wraps blocking calls with cancellation safety
• Method acquires _generation_lock, runs blocking operation in thread, and waits for completion on
 cancellation
• Refactor generate() to use _run_blocking_serialized() instead of direct asyncio.to_thread()
• Refactor chat() to use _run_blocking_serialized() for both MLLM and LLM paths
• Refactor stream_chat() MLLM path to use _run_blocking_serialized() and move lock outside async
 context
• Refactor stream_generate() to use _run_blocking_serialized() for all worker calls

vllm_mlx/engine/simple.py

qodo-code-review · 2026-03-24T14:34:20Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

1. MTP stream deadlock 🐞 Bug ⛯ Reliability

Description

_stream_generate_text() holds _generation_lock and now awaits _run_blocking_serialized(),
which tries to acquire the same lock again, deadlocking MLLM+MTP text-only streaming. Text-only MLLM
requests routed to this path will hang indefinitely.

Code

vllm_mlx/engine/simple.py[1199]

+            all_resps = await self._run_blocking_serialized(_run_all)

Evidence

_run_blocking_serialized() acquires _generation_lock unconditionally. _stream_generate_text()
wraps its generation body in async with self._generation_lock: and later calls `all_resps = await
self._run_blocking_serialized(_run_all)` within that block, causing nested acquisition of the same
asyncio.Lock and a deadlock.

vllm_mlx/engine/simple.py[229-246]
vllm_mlx/engine/simple.py[1012-1014]
vllm_mlx/engine/simple.py[1194-1201]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`_stream_generate_text()` currently acquires `self._generation_lock` and then calls `_run_blocking_serialized()`, which also acquires `self._generation_lock`. This deadlocks.

### Issue Context
This path is used for MLLM+MTP routing (text-only requests). The PR moved lock acquisition into `_run_blocking_serialized()`, so call sites that already hold the lock must be adjusted.

### Fix Focus Areas
- vllm_mlx/engine/simple.py[229-246]
- vllm_mlx/engine/simple.py[1012-1014]
- vllm_mlx/engine/simple.py[1194-1201]

### Implementation direction
Refactor `_stream_generate_text()` so `_run_all` (and thus MLX/Metal calls) run under `_run_blocking_serialized()` without an additional surrounding `async with self._generation_lock:` block; or add a dedicated cancellation-safe helper that does **not** reacquire the lock when the caller already holds it.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Flaky cancellation test 🐞 Bug ⛯ Reliability

Description

The new regression test doesn’t assert that first_started.wait() succeeded, so it can proceed even
if the first worker never started, weakening the test and making failures/pass conditions
misleading. This can cause false positives or intermittent behavior under scheduler variance.

Code

tests/test_simple_engine_cancel_serialization.py[R59-63]

+            await asyncio.to_thread(first_started.wait, 1.0)
+
+            task1.cancel()
+            task2 = asyncio.create_task(engine.generate(prompt="second", max_tokens=8))
+

Evidence

threading.Event.wait() returns a boolean indicating whether it was set before timeout, but the
test ignores it and immediately cancels task1 and starts task2. If the first worker didn’t start
within the timeout, the test is no longer validating the intended race.

tests/test_simple_engine_cancel_serialization.py[58-63]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The test should fail fast if the first worker never started; otherwise it may test a different scenario than intended.

### Issue Context
`threading.Event.wait(timeout)` returns `True/False`. The test currently ignores this return value.

### Fix Focus Areas
- tests/test_simple_engine_cancel_serialization.py[58-63]

### Implementation direction
Capture the boolean result and assert it is `True`, e.g.:
- `started = await asyncio.to_thread(first_started.wait, 1.0)`
- `self.assertTrue(started, "first worker did not start in time")`

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Cancelled worker errors hidden 🐞 Bug ✧ Quality

Description

When the request is cancelled, _run_blocking_serialized() waits for the worker thread but
suppresses any worker exception without logging, making MLX/Metal failures during cancellation hard
to diagnose. This reduces observability of precisely the failure mode the PR is trying to mitigate.

Code

vllm_mlx/engine/simple.py[R240-245]

+            except asyncio.CancelledError:
+                try:
+                    await task
+                except Exception:
+                    pass
+                raise

Evidence

In the cancellation handler, exceptions from the worker task are caught with except Exception: and
dropped (pass) without logging, so a worker crash during cancellation becomes invisible in logs
while the caller only sees CancelledError.

vllm_mlx/engine/simple.py[229-245]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
If the worker raises while the outer task is cancelled, the exception is swallowed, harming debuggability.

### Issue Context
The code correctly waits for the worker to finish to keep MLX/Metal serialized, but should still emit diagnostics if the worker failed.

### Fix Focus Areas
- vllm_mlx/engine/simple.py[229-245]

### Implementation direction
In the `except asyncio.CancelledError:` block, change the bare `except Exception: pass` into a log statement (e.g., `logger.warning(..., exc_info=True)`) before re-raising the cancellation.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

qodo-code-review · 2026-03-24T14:45:20Z

                    cleanup_rope(model)

-            all_resps = await asyncio.to_thread(_run_all)
+            all_resps = await self._run_blocking_serialized(_run_all)


1. Mtp stream deadlock 🐞 Bug ⛯ Reliability

_stream_generate_text() holds _generation_lock and now awaits _run_blocking_serialized(), which tries to acquire the same lock again, deadlocking MLLM+MTP text-only streaming. Text-only MLLM requests routed to this path will hang indefinitely.

Agent Prompt

### Issue description `_stream_generate_text()` currently acquires `self._generation_lock` and then calls `_run_blocking_serialized()`, which also acquires `self._generation_lock`. This deadlocks. ### Issue Context This path is used for MLLM+MTP routing (text-only requests). The PR moved lock acquisition into `_run_blocking_serialized()`, so call sites that already hold the lock must be adjusted. ### Fix Focus Areas - vllm_mlx/engine/simple.py[229-246] - vllm_mlx/engine/simple.py[1012-1014] - vllm_mlx/engine/simple.py[1194-1201] ### Implementation direction Refactor `_stream_generate_text()` so `_run_all` (and thus MLX/Metal calls) run under `_run_blocking_serialized()` without an additional surrounding `async with self._generation_lock:` block; or add a dedicated cancellation-safe helper that does **not** reacquire the lock when the caller already holds it.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

krystophny · 2026-03-26T07:16:28Z

Benchmark: Simple Engine vs Continuous Batching

Qwen3.5-2B on Apple Silicon (MLX), 128 max_tokens, streaming, 5 iterations per configuration.
Values are mean +/- stdev across iterations.

Aggregate throughput (tok/s) — total completion tokens / wall clock

Concurrency	Simple Engine	Continuous Batching
1	120 +/- 44	142 +/- 57
2	96 +/- 7	144 +/- 11
4	95 +/- 1	173 +/- 6
8	95 +/- 1	225 +/- 3

Wall clock (seconds) — time from first request to last response

Concurrency	Simple Engine	Continuous Batching
1	1.21 +/- 0.52	1.08 +/- 0.58
2	2.67 +/- 0.19	1.79 +/- 0.13
4	5.38 +/- 0.05	2.23 +/- 0.08
8	10.79 +/- 0.06	3.89 +/- 0.05

Per-request throughput (tok/s) — individual request completion tokens / elapsed

Concurrency	Simple Engine	Continuous Batching
1	120 +/- 44	142 +/- 57
2	68 +/- 23	72 +/- 5
4	46 +/- 24	47 +/- 19
8	30 +/- 23	30 +/- 7

TTFT — time to first content token

Measured from request send to first non-empty SSE delta. Includes prompt processing.

Concurrency	Metric	Simple Engine	Continuous Batching
1	min	1096 +/- 458 ms	1075 +/- 579 ms
1	max	1096 +/- 458 ms	1075 +/- 579 ms
2	min	1196 +/- 220 ms	1789 +/- 134 ms
2	max	2668 +/- 190 ms	1789 +/- 134 ms
4	min	1228 +/- 38 ms	129 +/- 27 ms
4	max	5379 +/- 48 ms	2231 +/- 82 ms
8	min	1392 +/- 193 ms	181 +/- 17 ms
8	max	10787 +/- 57 ms	3889 +/- 50 ms

Analysis

Throughput scaling:

Simple engine aggregate throughput is flat at ~95 tok/s regardless of concurrency (2-8). Requests serialize behind the generation lock — adding concurrent requests only adds queue wait.
Continuous batching scales: 142 -> 144 -> 173 -> 225 tok/s from 1 to 8 concurrent requests. At 8 concurrent, batching delivers 2.4x higher aggregate throughput (225 vs 95 tok/s).

Wall clock:

Simple engine wall clock scales linearly: 8 concurrent requests take 10.8s (= 8 x ~1.35s per request, serialized).
Continuous batching at 8 concurrent: 3.9s wall clock — 2.8x faster than simple engine.

Per-request throughput:

Nearly identical between engines at every concurrency level. Each individual request gets the same bandwidth — the difference is whether others can run concurrently.

TTFT behavior:

Simple engine: min TTFT stays ~1.2-1.4s regardless of concurrency (first-in-queue gets served immediately). Max TTFT grows linearly: 2.7s at conc=2, 5.4s at conc=4, 10.8s at conc=8 (last request waits for all preceding ones).
Continuous batching: at higher concurrency, min TTFT drops dramatically (129ms at conc=4, 181ms at conc=8) because short-response requests (e.g., "capital of France" -> 2 tokens) finish almost immediately within the batch. Max TTFT is the longest-generating request in the batch (2.2s at conc=4, 3.9s at conc=8).
Key difference: simple engine TTFT worst-case is sum of all preceding requests. Batching TTFT worst-case is just the longest individual request in the batch.

Bottom line: the simple engine serialization from this PR is correct behavior for the non-batched code path. Users needing concurrent request handling should use --continuous-batching.

) * fix: keep simple engine serialized across cancellation (#8) * fix: avoid nested simple engine generation locks * fix: catch BaseException in cancellation handler, fix async test markers _run_blocking_serialized catches CancelledError (a BaseException subclass) from the outer scope, but the inner try/except used Exception which would let a second CancelledError during await task escape unhandled. Changed to BaseException to suppress any exception from the draining await. Also fix test_simple_engine.py to use pytest.mark.anyio instead of pytest.mark.asyncio (pytest-asyncio is not configured), and add the anyio_backend fixture to conftest.py restricting to asyncio only since trio is not installed. * fix: preserve prompt token accounting after upstream refresh * fix: restore specprefill fallback helper scope

fix: keep simple engine serialized across cancellation

0bd6307

krystophny merged commit 892e5bf into main Mar 24, 2026

krystophny deleted the fix/simple-engine-cancel-serialization branch March 24, 2026 14:34

qodo-code-review Bot reviewed Mar 24, 2026

View reviewed changes

krystophny added a commit that referenced this pull request Mar 24, 2026

fix: keep simple engine serialized across cancellation (#8)

545e7e3

krystophny added a commit that referenced this pull request Mar 26, 2026

fix: keep simple engine serialized across cancellation (#8)

7c1caba

krystophny added a commit that referenced this pull request Mar 26, 2026

fix: keep simple engine serialized across cancellation (#8)

8f72147

krystophny added a commit that referenced this pull request Apr 9, 2026

fix: keep simple engine serialized across cancellation (#8)

8a782bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine: keep SimpleEngine serialized across cancellation#8

engine: keep SimpleEngine serialized across cancellation#8
krystophny merged 1 commit intomainfrom
fix/simple-engine-cancel-serialization

krystophny commented Mar 24, 2026

Uh oh!

qodo-code-review Bot commented Mar 24, 2026

Uh oh!

qodo-code-review Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot Mar 24, 2026

Uh oh!

krystophny commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

krystophny commented Mar 24, 2026

Summary

Reproducer

Fix

Tests

Uh oh!

qodo-code-review Bot commented Mar 24, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-code-review Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

krystophny commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark: Simple Engine vs Continuous Batching

Aggregate throughput (tok/s) — total completion tokens / wall clock

Wall clock (seconds) — time from first request to last response

Per-request throughput (tok/s) — individual request completion tokens / elapsed

TTFT — time to first content token

Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-code-review Bot commented Mar 24, 2026 •

edited

Loading

krystophny commented Mar 26, 2026 •

edited

Loading